Sparsity analysis of term weighting schemes: Application to Feature Selection

نویسندگان

  • Natasa Milic-Frayling
  • Dunja Mladenic
  • Janez Brank
  • Marko Grobelnik
چکیده

In this paper we revisit the practice of using feature selection for dimensionality and noise reduction. Commonly we score features according to some weighting scheme and then specify that the top N ranked features or top N percents of scored features are to be used for further processing. In text classification, such a selection criteria lead to significantly different sizes of (unique) feature sets across various weighting schemes, if a particular level of performance is to be achieved, for a given learning method. On the other hand the number and the type of features determine the sparsity characteristics of the training and test documents, i.e., the average number of features per document vector. We show that specifying sparsity level, instead of pre-defined number of features per category as the selection criteria, produces comparable average performance over the set of categories. At the same time it has an obvious advantage of providing the means for control of the consumption of computing memory resources. Furthermore, we show that observing sparsity characteristics of selected feature sets, in form of sparsity curves, can be useful in understanding the nature of the feature weighting scheme itself. In particular, we begin to understand the level at which feature specificity, or commonly called ‘rarity’ is incorporated into the term weighting scheme and accounted for in the learning algorithm.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Investigation of Term Weighting Schemes in Classification of Imbalanced Texts

Class imbalance problem in data, plays a critical role in use of machine learning methods for text classification since feature selection methods expect homogeneous distribution as well as machine learning methods. This study investigates two different kinds of feature selection metrics (one-sided and two-sided) as a global component of term weighting schemes (called as tffs) in scenarios where...

متن کامل

A Framework for Characterizing Feature Weighting and Selection Methods in Text Classification

Optimizing performance of classification models often involves feature selection to eliminate noise from the feature set or reduce computational complexity by controlling the dimensionality of the feature space. A refinement of the feature set is typically performed in two steps: by scoring and ranking the features and then applying a selection criterion. Empirical studies that explore the effe...

متن کامل

Feature Selection for the Classification of Large Document Collections

Feature selection methods are often applied in the context of document classification. They are particularly important for processing large data sets that may contain millions of documents and are typically represented by a large number, possibly tens of thousands of features. Processing large data sets thus raises the issue of computational resources and we often have to find the right trade-o...

متن کامل

A Novel Scheme for Improving Accuracy of KNN Classification Algorithm Based on the New Weighting Technique and Stepwise Feature Selection

K nearest neighbor algorithm is one of the most frequently used techniques in data mining for its integrity and performance. Though the KNN algorithm is highly effective in many cases, it has some essential deficiencies, which affects the classification accuracy of the algorithm. First, the effectiveness of the algorithm is affected by redundant and irrelevant features. Furthermore, this algori...

متن کامل

Uncorrelated Group LASSO

`2,1-norm is an effective regularization to enforce a simple group sparsity for feature learning. To capture some subtle structures among feature groups, we propose a new regularization called exclusive group `2,1-norm. It enforces the sparsity at the intra-group level by using `2,1-norm, while encourages the selected features to distribute in different groups by using `2 norm at the inter-grou...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003